Parsing structured data

ABSTRACT

A method for parsing structured data has the steps of: receiving input data in a first computer language; generating a plurality of tokens according to the input data; building a context by using a grammar syntax comprising a set of rules, the context comprising a plurality of context steps in the form of at least one or more chains of context steps, the step of building the context comprising the sub-steps of: detecting if according to the grammar syntax a token is allowable in the context; and if the token is allowable, creating a new context step corresponding to the token, and the further steps for recovering an unallowable token: identifying a suitable context for the unallowable token in which context the token is allowable; and applying the token in the identified suitable context.

[0001] The invention relates to document parsing, in particular togrammatical parsing of documents.

[0002] The Internet has brought into being a number of new applicationsfor various purposes. Underlying the rapid growth of the Internet is theHyperText Markup Language (HTML) standard for definition of documents indigital form. HTML is a subset of Standard Generalized Markup Language(SGML). Within SGML there is another, rapidly growing family ofdefinitions, called extensible Markup Language (XML). Furthermore, thereis Wireless Markup Language WML, which is especially designed for use inmobile communications. Both HTML and WML are subsets of XML.

[0003] HTML is used for an enormous number of documents published in theInternet. These documents are usually available to the public andprovide a highly diverse source of information. The information indigital form is often referred to as content.

[0004] Unlike HTML, WML is designed particularly for wireless terminals.The amount of content in the form of WML is, as of yet, very limitedcompared to that in the form of HTML. A wireless terminal supportingonly WML (an WML terminal) cannot use the content in HTML.

[0005] In order to make interesting HTML-formatted content available foruse in WML terminals there are two options. Firstly, the contentdocuments can be re-written in WML. Secondly, a network relaying HTMLcontent to an WML terminal can perform an automatic conversion from HTMLto WML when the terminal requests such content. This can be arranged byusing between the WML terminal and the Internet a gateway server, whichhas the capability of converting content from HTML to WML.

[0006] Both HTML and XML are under constant development. HTML isconverging towards XML, or HTML is becoming one instance of the XMLlanguage family.

[0007] The generation and reconstruction of HTML and XML documents isnext described. A document is first broken up so that its formatting andmeaning (actual content) are stored separately in different mark-uptags, or simply tags. In HTML documents, the tags are in a sequence andhence they are sequentially transmitted. Some tags contain structuralinformation for defining the structure of a document the contentdefines, whilst some other tags contain the meaning in clips ofinformation to be output to a user according to the defined structure.Typically, these clips are text.

[0008] A terminal receiving HTML, XML, or WML content receives a seriesof tokens. For reconstructing a document, or content, in a mark-uplanguage, an assembler is required. The assembler puts back together theformatting and the meaning of the document. In the core of the assemblerthere is a parser that parses the data according to certain parse rulesand a parsing language, that is grammar. The parser is typically aprogram controlling a processor of the terminal. The parser receivesinput in the form of sequential mark-up tags (interleaved with characterdata tokens) and breaks the input up into parts (for example, the nouns(objects), verbs (methods), and their attributes or options) that canthen be managed by other software components. The parser may also checkto see that all input has been provided that is necessary. In thiscontext, the parser breaks the input into tokens and builds thestructure according to the tokens. The tokens are typically a parser'sinternal representation of tags or textual data (character data token).

[0009] A parser is needed for a plurality of different HTML or XMLprocessing applications, such as gateways, HTML-browsers, mobileterminals, authoring tools, and in some occasions, web servers.

[0010] Unfortunately, the constant development of HTML and XML resultsin a need to frequently update the equipment used for conversion betweenthese languages in order to deal with new documents. Therefore, theparser contained in the equipment should be updated frequently to copewith different dialects of the languages. This has usually been carriedout by building a new version of the assembler whenever required. Forthis purpose there are at least two tool programs, namely the “yacc”(Yet Another Compiler Compiler) and “lex”. While these tool programsgreatly facilitate building of the new parser, a syntax of the languageto be parsed needs first to be described in a dedicated language. Thesyntax defines how the parsing should be carried out. When the newparser is ready, the syntax description (the dedicated language) isprocessed with filtering tools to generate a source code representationof the parser. The filtering tools typically generate C programminglanguage. The produced parser is a monolithic piece of software, whichcombines the syntax rules and the parser logic. Finally, the generatedparser source code is compiled and linked with the application code toproduce an executable program with parser functionality. The drawback ofthis procedure is the high amount of labour required to adapt theequipment to changes in input language (HTML, XML).

[0011] Typically modem XML documents identify a grammatical definitionthat will be searched from a network and thus dynamically replaceablegrammatical definition is required. Typically a particular reference inthe document is used for the identification of the grammar definition.

[0012] U.S. Pat. No. 5,687,378 provides an alternative procedure thatallows the dedicated syntax description language to be changed withoutrecompiling or re-linking the assembler. This is based on the use ofswitchable syntax modules each comprising a different set of parserules. The parse rules define the grammar used by the parser. In thisway, the actual parser engine is separated from the rules used and therules can be easily changed. While the parsing rules can thus be changedand the adaptation to new description languages or dialects has becomeeasier, certain problems remain.

[0013] The parsers generated with standard tools become rather complexand memory consuming, since they provide complex syntax descriptionlanguages for covering more descriptive languages than XML. In the caseof XML, a less descriptive syntax description language would suffice. Ifthe parser were optimised for XML, it would be smaller. Furthermore, agreat number of the pages present in the Internet are deficient. Thedefects in these pages hinder and/or slow down the parsing. It is alsoknown, that some content providers may deliberately generate certainerrors in order to prevent or harm the use of certain applications, suchas certain WWW-browsers.

[0014] According to a first aspect of the invention, there is provided amethod for parsing structured data comprising the steps of:

[0015] receiving input data in a first computer language;

[0016] generating a plurality of tokens according to the input data;

[0017] building a context by using a grammar syntax comprising a set ofrules, the context comprising a plurality of context steps in the formof at least one or more chains of context steps, the step of buildingthe context comprising the sub-steps of:

[0018] detecting if according to the grammar syntax a token is allowablein the context; and

[0019] if the token is allowable, creating a new context stepcorresponding to the token;

[0020] characterised in that the method comprises the further steps forrecovering an unallowable token:

[0021] identifying a suitable context for the unallowable token in whichcontext the token is allowable; and

[0022] applying the token in the identified suitable context.

[0023] Advantageously, the method makes use of all the correctinformation of partly erroneous data and tries to recover errors so thateven erroneously presented information could be used. The parsing anderror recovery as a part of it is fast and requires reduced amount ofcomputing and memory resources.

[0024] Preferably, the identifying the suitable context comprisessearching for a least different allowable context for the token usingthe grammar syntax.

[0025] Advantageously, the searching for the least different allowablecontext minimises the (erroneous) change in the structure of theinformation on recovery.

[0026] Preferably, the applying the token comprises modifying thecontext to be equal to the least different allowable context.

[0027] Preferably, the searching the least different allowable contextattempts to generate a new context step according to a rule specific tothe unallowable token to complement the context so that the unallowabletoken becomes allowable. In this way, a forgotten or deliberatelydropped token can automatically be inserted in order to estimate thelikely context.

[0028] Preferably, the searching the least different allowable contextproceeds backwards along the chain of context steps one by one until anallowable context is found. In this way, a forgotten or purposelydropped end token can be overcome.

[0029] Preferably, the method first attempts to complement the contextby adding such a context step, which is allowable, according to thesyntax, both to the preceding context step and to the unallowablecontext step. Preferably, if such complementing fails, the backwardsproceeding searching is performed.

[0030] Preferably, after each context step proceeded backwards, themethod attempts again to complement the context to conform to the token.In this way, it is possible to overcome the lack of a missing end tokenor another token. In this way, the structure of the data is changedleast, because in a context of structured data, the last steps have theleast significance. It is most probable that the first allowable contextstep in reverse order in the context chain is correct for the token thatis not allowable in a certain context in which it was to be used.

[0031] Preferably, a rule specific for the generic syntax of the firstcomputer language is used to determine whether a token is allowable in acontext. This allows detection of such errors that relate to the generictype of the first computer language.

[0032] Preferably, a rule specific for a type of token is used in thesearching step to determine whether a token is allowable in a context.This also allows recovery from such errors that cannot be recovered byusing a rule specific for the generic syntax of the first computerlanguage.

[0033] Preferably, both syntax and type of token specific recoveringmethods are used for recovering an unallowable token.

[0034] Preferably, the method comprises a further step of dynamicallyswitching to use another grammar syntax to adapt to a different languageor to a different dialect of the language.

[0035] According to a second aspect of the invention there is provided aparser comprising:

[0036] an input for receiving input data in a first computer language;

[0037] a tokeniser for generating a plurality of tokens according to theinput data;

[0038] a context builder for building a context by using a grammarsyntax comprising a set of rules, the context comprising a plurality ofcontext steps in the form of at least one chain of context steps, thecontext builder being configured to:

[0039] detect if according to the grammar syntax a token is allowable inthe context; and

[0040] if the token is allowable, to create a new context stepcorresponding to the token;

[0041] characterised in that

[0042] the parser further comprises for recovering an unallowable token:

[0043] an identifying block for identifying for the unallowable token asuitable context in which the token is allowable; and

[0044] the tokeniser is configured to apply the token in the identifiedsuitable context.

[0045] According to a third aspect of the invention there is provided aprocessing unit having a parser comprising:

[0046] an input for receiving input data in a first computer language;

[0047] a processor for generating a plurality of tokens according to theinput data;

[0048] a context builder for building a context by using a grammarsyntax comprising a set of rules, the context comprising a plurality ofcontext steps in the form of at least one chain of context steps,configured to:

[0049] detect if according to the grammar syntax a token is allowable inthe context; and

[0050] if the token is allowable, to create a new context stepcorresponding to the token;

[0051] characterised in that the parser further comprises for recoveringan unallowable token:

[0052] identifying block for identifying for the unallowable token asuitable context in which the token is allowable; and

[0053] the processor is configured to apply the token in the identifiedsuitable context.

[0054] Preferably, the processing unit is a device selected from a groupconsisting of: a translator, a gateway, a mobile station and a webserver. The device has the advantage of being able to recover from someerrors in the input data.

[0055] According to a fourth aspect of the invention there is provided acomputer program product for controlling a parser comprising:

[0056] parser executable computer program code to enable the parser toreceive input data in a first computer language;

[0057] parser executable computer program code to enable the parser togenerate a plurality of tokens according to the input data;

[0058] parser executable computer program code to enable the parser tobuild a context by using a grammar syntax comprising a set of rules, thecontext comprising a plurality of context steps in the form of at leastone chain of context steps, the code being configured to:

[0059] detect if according to the grammar syntax a token is allowable inthe context; and

[0060] if the token is allowable, to create a new context stepcorresponding to the token;

[0061] characterised in that the computer program product furthercomprises for recovering an unallowable token:

[0062] parser executable computer program code to enable the parser toidentify for the unallowable token a suitable context in which the tokenis allowable; and

[0063] parser executable computer program code to enable the parser toapply the token in the identified suitable context.

[0064] According to a fifth aspect of the invention there is provided asystem comprising a mobile telecommunications network and a gatewayhaving a parser that comprises:

[0065] an input for receiving input data in a first computer language;

[0066] a processor for generating a plurality of tokens according to theinput data;

[0067] a context builder for building a context by using a grammarsyntax comprising a set of rules, the context comprising a plurality ofcontext steps in the form of at least one chain of context steps,configured to:

[0068] detect if according to the grammar syntax a token is allowable inthe context; and

[0069] if the token is allowable, to create a new context stepcorresponding to the token;

[0070] characterised in that the parser further comprises for recoveringan unallowable token:

[0071] an identifying block for identifying for the unallowable token asuitable context in which the token is allowable; and

[0072] the processor is configured to apply the token in the identifiedsuitable context.

[0073] The invention will now be described, by way of example only, withreference to the accompanying drawings, in which

[0074]FIG. 1 shows a set of HTML definitions defining an exemplarystructure;

[0075]FIG. 2 shows a diagram of the structure of FIG. 1;

[0076]FIG. 3 shows a generic timing chart of parsing of structured datain a sequential order according to the prior art;

[0077]FIG. 4 shows a generic timing chart of parsing of structured datain a partly parallel order according to a preferred embodiment of theinvention;

[0078]FIG. 5 shows an example of generic correction of a token accordingto the preferred embodiment;

[0079]FIG. 6 shows a flow chart of a recovery process to define acorrect context for an erratic token according to the preferredembodiment;

[0080]FIG. 7 shows a system according to the preferred embodiment of theinvention; and

[0081]FIG. 8 shows a block diagram of a parser of the system shown inFIG. 7.

[0082]FIG. 1 shows a set of HTML definitions, or (markup) tokens 10, fordefining the structure of an exemplary HTML document. Each token isenclosed by < and > marks or by quotation marks. The tokens surroundedby < and > marks usually contain information concerning a structure ofdata to form a context to user data, whereas the tokens enclosed byquotation marks represent the user data concerning a context. In thestructure of FIG. 1, a sequence of tokens contains a first list itemtoken LI 10, a first end of list item token /LI 11, a second list itemtoken LI 12 and a second end of list token /LI 14.

[0083] In HTML, a document is arranged in a tree-like structure bydefinitions determining the various branches of the tree. FIG. 2 a showstree-like view of the structure of the document whose tokens are shownin FIG. 1. For example, a token HTML 2 represents the highest level ofthe tree-like structure, and tokens HEAD 2.1 and BODY 2.2 represent twomain branches. The token HEAD 2.1 defines the header of the document,comprising again two branches, TITLE 2.1.1 for defining the title of theheader and META 2.1.2 describing the document in various ways. At thebottom of the header branch, under the TITLE 2.1.1, there is a token2.1.1.1 containing a string of text “EXAMPLE TEXT”, which is to be usedas the header. The second main branch divides into tokens 2.2.1 H1 and2.2.2 OL. H1 has one subordinate token 2.2.1.1 HEADER. OL 2.2.2 has twosubordinate tokens, 2.2.2.1 and 2.2.2.2, each representing one list itemLI in an ordered list OL. The list items (tokens 2.2.2.1 and 2.2.2.2)have respective subordinate tokens defining the content of the listitems. In this example, the text of the first list item 2.2.2.1.1 is SECand the text of the second list item 2.2.2.2.1 is FIRST.

[0084] It is helpful in understanding the invention to realise thatstructured languages such as HTML and XML cause a tree-like structure,where the most significant definitions are on first hierarchy levels asopposed to the least significant definitions at the end of the branchesof the tree. For example, the token 2.2.2.2.1 only affects to one listitem, but the token 2.2 defines the start of the entire body containingthe ordered list OL.

[0085] The structured hierarchy of HTML documents has been utilised byarranging the steps required for translation of them into otherlanguages, as is shown in FIG. 3. If an HTML document is to betranslated into another language or into a dialect of HTML, the tokensare first generated in a tokenising phase TOK. After tokenising there isa parsing phase PAR, in which the tokens are parsed to generate astructure. A translation phase TRA follows, in which the structure canbe described by the destination language or dialect. After thetranslation phase TRA is completed, the result can be transmitted over acommunications channel in a transmitting phase COM. This approach iseasy to manage and provides an opportunity for enhanced robustness byallowing the tokens to be parsed regardless of their order. However, onedisadvantage of the process is that the amount of time it takes can beas much as the sum of all its phases. To overcome this problem, there isan alternative translation process that utilises the structure of thesource and destination languages to accelerate the parsing, as shown inFIG. 4.

[0086]FIG. 4 shows a timing chart of parsing of structured data in aparallel fashion, according to a preferred embodiment of the invention.The process differs from the process shown in FIG. 3 in that soon afterthe tokenising has started, the subsequent phases are started whilst thetokenising is progressing. During tokenising, the phases are started inthe order of the parsing phase PAR, the translation phase TRA and thetransmitting phase COM. Thus, all of these phases occur in parallel. Inthis example, the transmitting is started before the end of thetokenising. These processes can be carried out partially in parallelsince each branch of the tree-like structure can be processed asseparate units.

[0087] The error recovery process will now be described. Referring backto FIG. 1, lack of the first end of list token 11 would cause an errorin the parsing of the tokens. Strictly speaking, the second list token12 may not be subordinate to the first list item 10. In this particularexample, it can be appreciated that appearance of a new list item shouldautomatically cause parsing to close the previous list item branch.After closure of the previous list branch, a new branch should becreated for a new list item. Such induction enables recovery of at leastsome errors. Two types of correction methods are next handled. In afirst type of method, correction is based on a generic feature of thelanguage being interpreted. In a second type of method, correction isbased on a specific rule (that is, to the type of token). The preferredembodiment of the invention comprises capability for both types ofmethods. They will now be explained referring to corresponding FIGS. 5and 6.

[0088]FIG. 5 shows an example of generic correction of a token accordingto the preferred embodiment of the invention. In this case, a sequenceof tokens HTML 50, BODY 52, Object List token OL 54, List Item token LI56, Object List token OL 58 and an ImaGe IMG 59 is received in thatorder, but the last token IMG 59 mismatches with the preceding token OL58. Applicability of each token to match with that last token IMG 59 istested to find the likely correct position for the mismatched last token59. In this example, the last token IMG 59 is an image token that,according to grammar rules, cannot be subordinate to the LI 56. Neithercan it follow next the OL 54 that precedes the LI 56. Instead, it can besubordinate to token BODY 52 that precedes the OL 54. Therefore, thisposition is found in an efficient processing manner by proceedingbackwards along the parsed structure. Most frequently, the correctposition of an unallowable token is close to the end at which it islocated. By using a reverse procedure (proceeding backwards along abranch) it is possible to find the most likely correct position with aminor amount of processing. Such a procedure also causes the smallestpossible (erroneous) change in the result because the changes areprimarily carried out on the least significant parts of the context(farthest from the root of the tree-like structure).

[0089]FIG. 6 shows a rule-specific recovery process to define a correctcontext for an erraneous token, according to the preferred embodiment ofthe invention. In this case, a sequence of tokens HTML 62, BODY 63 andLI 64 is received. The LI 64 cannot be subordinate to a BODY token.Therefore, a rule specific correction method is used to add a properpreceding token, in this case a user list token UL 66. Typically, thereis an additional test whether such added token is appropriateconsidering its subordinate, in this case the BODY 63. If this testsucceeds, the added token is deemed to be valid. The LI 64 is indeed avalid subordinate to the UL 64 and the test succeeds. As result, the newsequence of tokens is the HTML 62, the BODY 63, the UL 66 and the LI 64.In this case, the LI 64 can be subordinate to only one type of token,UL. Hence, the UL 66 is automatically added.

[0090] In an embodiment of the invention, it is detected whether anappropriate token is not the only alternative. If other alternativesexist, a heuristic analysis is used to determine the most likelyalternative that should be used. For instance, the parser may maintain atable of probabilities for each possible different series of subsequenttokens. Then, the parser selects such a token to be added that in serieswith a pre-existing adjacent token has the greatest probability. In yetanother and more simple embodiment, instead of heuristic analysis, theparser picks an additional token substantially randomly within the groupof allowable tokens (alternatives).

[0091] If the added token is not appropriate, the added token may besimply abandoned. A more intelligent approach is to attempt addinganother token between the BODY 63 and the added token in order to end upwith a valid branch. This allows recovery even in cases where more thanone token is missing, but the missing tokens can be reasonably deducedfrom the context.

[0092]FIG. 7 shows a system according to the preferred embodiment of theinvention. The system comprises two data communications networks: a GSMtelephone network 72 and an IP (Internet Protocol) network 71, such asthe Internet or an intranet. The two data communications networks arelinked by a gateway 70. The gateway 70 comprises a first parser 73 (forHTML) and a second parser 74 (for WML) for translating data (content)between HTML and WML. A terminal 75 is coupled to the GSM telephonenetwork 72. The terminal is a dual-mode terminal such as a laptoppersonal computer PC having a GSM module. In the preferred embodiment,the same terminal also has functionality for connecting to the IPnetwork 71. The terminal comprises a parser 76 for processing WMLdocuments.

[0093]FIG. 8 shows a block diagram of the first parser 73 of FIG. 7. Thefirst parser has a processor μP, an input IN for receiving data to theprocessor μP, an output OUT for outputting data from the processor μPand thus from the parser 73 to outside of the parser 73, parser softwareP1, memory MEM, and a database DB of parsing rules comprising twodifferent sets of parsing rules R1, and R2.

[0094] In operation, the processor μP receives a set of tokens in aninput language, in this case in HTML. The processor μP determines thelanguage and/or dialect of the input language and retrieves acorresponding set of parsing rules from the database DB. Using theparser software P1 and the memory MEM, the processor μP generates astructure according to the received set of tokens. Then, the processorμP sends the structure to the second parser 74 so that the second parser74 can generate a new set of tokens, in an output language, according toits parsing rules.

[0095] The structure of the second parser may correspond to that of thefirst parser. In the preferred embodiment, the second parser 74 has alsoa database comprising two or more sets of parsing and translation rulesso that it adapts to produce a correct language or dialect.

[0096] Using the two parsers 73 and 74, the gateway 70 can translatedata from one markup language to another. The gateway 70 may adaptdynamically to different languages it processes. The possibility toadapt the second parser to different languages also allows use of onegateway for serving various different types of networks.

[0097] According to the preferred embodiment, the database DB is updateddynamically on demand. For example, the processor may recognise an inputlanguage or dialect for which it does not have a set of rules. In thiscase, the gateway may have a connection to another rule set database(that may be centrally managed) and retrieve a proper set of rules toits own database. In this way, the parser can automatically maintainitself up to date without any hand-made reconfiguration.

[0098] Particular implementations and embodiments of the invention havebeen described. It is clear to a person skilled in the art that theinvention is not restricted to details of the embodiments presentedabove, but that it can be implemented in other embodiments usingequivalent means without deviating from the characteristics of theinvention. The scope of the invention is only restricted by the attachedpatent claims.

1. A method for parsing structured data comprising the steps of:receiving input data in a first computer language; generating aplurality of tokens according to the input data; building a context byusing a grammar syntax comprising a set of rules, the context comprisinga plurality of context steps in the form of at least one or more chainsof context steps, the step of building the context comprising thesub-steps of: detecting if according to the grammar syntax a token isallowable in the context; and if the token is allowable, creating a newcontext step corresponding to the token; characterised in that themethod comprises the further steps for recovering an unallowable token:identifying a suitable context for the unallowable token in whichcontext the token is allowable; and applying the token in the identifiedsuitable context.
 2. A method according to claim 1, wherein theidentifying of the suitable context comprises searching for a leastdifferent allowable context for the token using the grammar syntax.
 3. Amethod according to claim 2, wherein the applying the token comprisesmodifying the context to be equal to the least different allowablecontext.
 4. A method according to claim 2, wherein the searching for theleast different allowable context attempts to generate a new contextstep according to a rule specific to the unallowable token to complementthe context so that the unallowable token becomes allowable.
 5. A methodaccording to any of the preceding claims, wherein the method firstattempts to complement the context by adding such a context step, whichis allowable, according to the syntax, both to the preceding contextstep and to the unallowable context step.
 6. A method according to claim5, wherein, if such complementing fails, a backwards proceedingsearching for the suitable context is performed.
 7. A method accordingto any of the preceding claims, wherein at least one of a rule specificfor the generic syntax of the first computer language and a type oftoken is used for determining whether a token is allowable in a context.8. A method according to any of the preceding claims, wherein the methodcomprises a further step of dynamically switching to use another grammarsyntax to adapt to a different language or to a different dialect of thelanguage.
 9. A parser comprising: an input for receiving input data in afirst computer language; a tokeniser for generating a plurality oftokens according to the input data; a context builder for building acontext by using a grammar syntax comprising a set of rules, the contextcomprising a plurality of context steps in the form of at least onechain of context steps, the context builder being configured to: detectif according to the grammar syntax a token is allowable in the context;and if the token is allowable, to create a new context stepcorresponding to the token; characterised in that the parser furthercomprises for recovering an unallowable token: an identifying block foridentifying for the unallowable token a suitable context in which thetoken is allowable; and the tokeniser is configured to apply the tokenin the identified suitable context.
 10. A computer program product forcontrolling a parser comprising: parser executable computer program codeto enable the parser to receive input data in a first computer language;parser executable computer program code to enable the parser to generatea plurality of tokens according to the input data; parser executablecomputer program code to enable the parser to build a context by using agrammar syntax comprising a set of rules, the context comprising aplurality of context steps in the form of at least one chain of contextsteps, the code being configured to: detect if according to the grammarsyntax a token is allowable in the context; and if the token isallowable, to create a new context step corresponding to the token;characterised in that the computer program product further comprises forrecovering an unallowable token: parser executable computer program codeto enable the parser to identify for the unallowable token a suitablecontext in which the token is allowable; and parser executable computerprogram code to enable the parser to apply the token in the identifiedsuitable context.
 11. A processing unit having a parser comprising: aninput for receiving input data in a first computer language; a processorfor generating a plurality of tokens according to the input data; acontext builder for building a context by using a grammar syntaxcomprising a set of rules, the context comprising a plurality of contextsteps in the form of at least one chain of context steps, configured to:detect if according to the grammar syntax a token is allowable in thecontext; and if the token is allowable, to create a new context stepcorresponding to the token; characterised in that the parser furthercomprises for recovering an unallowable token: identifying block foridentifying for the unallowable token a suitable context in which thetoken is allowable; and the processor is configured to apply the tokenin the identified suitable context.
 12. A processing unit according toclaim 11, comprising a device selected from a group consisting of: atranslator, a gateway, a mobile station and a web server.
 13. A systemcomprising a mobile telecommunications network and a gateway having aparser that comprises: an input for receiving input data in a firstcomputer language; a processor for generating a plurality of tokensaccording to the input data; a context builder for building a context byusing a grammar syntax comprising a set of rules, the context comprisinga plurality of context steps in the form of at least one chain ofcontext steps, configured to: detect if according to the grammar syntaxa token is allowable in the context; and if the token is allowable, tocreate a new context step corresponding to the token; characterised inthat the parser further comprises for recovering an unallowable token:an identifying block for identifying for the unallowable token asuitable context in which the token is allowable; and the processor isconfigured to apply the token in the identified suitable context.